Chapter 16 Written Questions

“How does it work?” Section

Question 1

  1. Calculating Zac mean:
zac <- (3+5+7)/3
zac
## [1] 5

Calculating Skylar mean:

skylar <- (2+6+10)/3
skylar
## [1] 6

Between variation is 6-5=1.

  1. Finding within variation Zac, 2012: 3-5=-2 Zac, 2013: 7-5=2 Zac, 2014: 5-5=0

Skylar, 2012: 2-6=-4 Skylar, 2013: 6-6=0 Skylar, 2014: 10-6=4

  1. The variation that actually ends up getting used to estimate the coefficient focuses more heavily on the individual that has a lot of variation over time. Our fixed effects estimate will likely give us an answer closer to Skylar’s treatment effect, so our fixed effects estimate will be closer to 2.

Question 2

Causal diagram below

2 All paths will be closed by including city fixed effects.

Question 3

  1. within

  2. within

  3. between

  4. between

  5. combination

  6. between

Question 4

The process of taking each observation relative to its individual-level mean has the effect of controlling for the individual because once we do this process, we’re left with the way we vary relative to our own averages. This allows us to isolate within variation, effectively ‘controlling for individual’ differences.

“How is it performed?” Section

Question 1

For a given city, in a year where # of cultural events is one unit higher than it typically is for that city, then we’d expect levels of trust to be 3.6 units higher than it typically is for that city.

Question 2

For a given city in a given year, if the # of cultural events is one unit higher than it typically is for that city in that year, then we’d expect levels of trust to be 2.4 units higher than it typically is for that city in that year.

Question 3

Controlling for individual and time effects allows you to isolate variation in a value for a given individual, in a given year. In contrast, simply adding a linear/polynomial/etc term to the regression equation for time (and including individual fixed effects) would allow you to find the estimate for a given person in a given year, but it wouldn’t directly tell you how the outcome varies relative to what is expected for that person in that year. In other words, controlling for individual and time fixed effects allows you to isolate variation from the ‘typical’ case for that person in that year, rather than just estimate the value in that year for that individual person.

Question 4

Because random effects allows some amount of between variation into the model, and some of the real individual effect is that between variation. (B)

Coding Exercises

Follow the below instructions and turn in both your code and results:

Question 1

Load the mathpnl.csv data file provided (in R or Python store it as mp), which comes from Leslie Papke and consists of data at the school district level, and was featured in the Wooldridge (2010) textbook.

We are only going to be working with a few variables. Limit the data to these variables:

mp <- read.csv('https://raw.githubusercontent.com/NickCH-K/TheEffectAssignments/main/mathpnl.csv')
mp <- mp|>select(c('distid', 'year', 'math4', 'expp', 'lunch'))

Question 2

Panel data is often described as “N by T”. That is, the number of different individuals N and the number of time periods T. Write code that outputs what N and T are in this data.

Language-specific instructions:

df <- unique(mp['distid'])
nrow(df)
[1] 550

There are 550 unique individuals

nrow(unique(mp['year']))
[1] 7

There are 7 years

Question 3

A balanced panel is one in which each individual shows up in every single time period. You can check whether a data set is a balanced panel by seeing whether the number of unique time periods each individual ID shows up in is the same as the number of unique time periods, or whether the number of unique individual IDs in each time period is the same as the total number of unique individual IDs. Think to yourself a second about why these procedures would check that this is a balanced panel. Then, check whether this data set is a balanced panel.

(hint: there are many ways to do this, but the easiest is to limit the data to just individual ID and year, drop any duplicates (keeping only unique() values in R, drop duplicates in Stata, or .drop_duplicates() in Python), and tabulating how many times each year appears (table() in R, tabulate in Stata, .value_counts() in Python))

mp_1 <- mp|>select(c('distid', 'year'))
df <- unique(mp_1)
nrow(df)
[1] 3850

This data is balanced, since the number of rows in the unique person/year dataframe is the same as the number of rows in the entire dataframe.

Question 4

Run an OLS regression, with no fixed effects, of math4 on expp and lunch. Store the results as m1. Language-specific instructions: - Yes, store them in Stata too. Following your regression use estimates store.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.4     ✓ stringr 1.4.0
✓ tidyr   1.1.3     ✓ forcats 0.5.1
✓ readr   2.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(modelsummary)
Warning: package 'modelsummary' was built under R version 4.1.2
m1 <- lm(math4 ~ expp + lunch, data = mp)
msummary(m1)
Model 1
(Intercept) 29.527
(1.139)
expp 0.007
(0.000)
lunch −0.381
(0.016)
Num.Obs. 3850
R2 0.308
R2 Adj. 0.307
AIC 31856.6
BIC 31881.6
Log.Lik. −15924.294
F 855.419

Question 5

Modify the model in step 4 to include fixed effects for distid “by hand”. That is, subtract out the within-distid mean of math4, expp, and lunch, creating new variables math4_demean, expp_demean, and lunch_demean, and re-estimate the model using those variables, storing the result as m2. (tip: be careful that your code doesn’t overwrite the original variables, or at least if it does, reload the data afterwards)

mp <- mp|> mutate(math4_demean=math4-mean(math4), expp_demean=expp-mean(expp), lunch_demean=lunch-mean(lunch))
m2 <- lm(math4_demean ~ expp_demean + lunch_demean, data = mp)
msummary(m2)
Model 1
(Intercept) 0.000
(0.244)
expp_demean 0.007
(0.000)
lunch_demean −0.381
(0.016)
Num.Obs. 3850
R2 0.308
R2 Adj. 0.307
AIC 31856.6
BIC 31881.6
Log.Lik. −15924.294
F 855.419

Question 6

Next we’re going to estimate fixed effects by including distid as a set of dummies. This can be extremely slow, so for demonstration purposes use only the first 500 observations of your data (don’t get rid of the other observations, though, you’ll want them for the rest of this assignment). Run the model from step 4 but with dummies for different values of distid, saving the result as m3. Then, do a joint F test on the dummies (see Chapter 13), and report if you can reject that the dummies are jointly zero at the 99% level. Tip: distid is stored as a numeric variable, but you want it to be treated as a categorical variable. If your regression result only has one coefficient for distid you’ve done it wrong.

mp_2 <- head(mp, 500)
mp_2 <- mp_2 |> mutate(distid_factor=factor(distid))



# Running the model
m3 <- glm(math4 ~  expp + lunch + distid_factor, data = mp_2)
msummary(m3)
Model 1
(Intercept) −29.659
(8.785)
expp 0.009
(0.001)
lunch 0.801
(0.193)
distid_factor2010 −12.888
(7.049)
distid_factor2020 −38.520
(7.497)
distid_factor2070 30.609
(7.341)
distid_factor2080 −1.467
(6.897)
distid_factor3000 31.446
(7.765)
distid_factor3010 32.052
(8.353)
distid_factor3020 38.661
(7.806)
distid_factor3030 26.283
(7.612)
distid_factor3040 34.579
(7.894)
distid_factor3050 1.243
(7.054)
distid_factor3060 15.596
(7.255)
distid_factor3070 25.289
(7.809)
distid_factor3080 11.404
(7.940)
distid_factor3100 38.535
(8.727)
distid_factor3440 48.652
(10.265)
distid_factor4000 14.641
(6.959)
distid_factor4010 18.442
(7.056)
distid_factor5010 −34.162
(7.011)
distid_factor5035 11.512
(7.068)
distid_factor5040 36.178
(7.662)
distid_factor5060 33.887
(7.620)
distid_factor5065 12.201
(7.100)
distid_factor5070 0.939
(6.902)
distid_factor6010 −15.379
(7.040)
distid_factor6020 16.476
(7.003)
distid_factor6050 9.037
(6.970)
distid_factor7010 −12.455
(7.673)
distid_factor7040 −1.222
(6.910)
distid_factor8010 24.996
(7.211)
distid_factor8030 31.327
(7.795)
distid_factor8050 27.043
(8.199)
distid_factor9010 10.542
(6.987)
distid_factor9030 32.609
(7.480)
distid_factor9050 28.589
(8.424)
distid_factor9090 15.756
(7.087)
distid_factor10015 13.916
(6.936)
distid_factor10025 14.419
(7.038)
distid_factor11000 21.332
(6.962)
distid_factor11010 −50.128
(10.550)
distid_factor11020 44.821
(9.261)
distid_factor11030 44.543
(8.731)
distid_factor11033 31.772
(7.804)
distid_factor11160 10.894
(7.669)
distid_factor11200 3.076
(7.334)
distid_factor11210 37.262
(7.539)
distid_factor11240 11.121
(6.954)
distid_factor11250 −3.952
(7.402)
distid_factor11300 14.490
(6.916)
distid_factor11310 28.201
(7.042)
distid_factor11320 15.580
(7.052)
distid_factor11330 2.584
(6.936)
distid_factor11340 8.639
(8.936)
distid_factor11670 30.838
(10.190)
distid_factor12010 16.742
(7.678)
distid_factor12020 7.010
(7.054)
distid_factor12040 13.320
(7.532)
distid_factor13000 20.026
(7.034)
distid_factor13010 −24.651
(7.564)
distid_factor13020 −29.100
(7.337)
distid_factor13050 18.361
(7.514)
distid_factor13070 34.718
(7.900)
distid_factor13080 25.202
(7.173)
distid_factor13090 31.129
(8.858)
distid_factor13095 12.237
(7.067)
distid_factor13110 35.665
(8.320)
distid_factor13120 29.180
(8.210)
distid_factor13130 16.800
(6.972)
distid_factor13135 21.951
(7.180)
distid_factor14010 2.740
(6.954)
distid_factor14020 17.068
(6.923)
distid_factor14030 15.137
(9.378)
Num.Obs. 500
AIC 4045.8
BIC 4361.9
Log.Lik. −1947.924

Performing joint F test

library(car)
Loading required package: carData
Warning: package 'carData' was built under R version 4.1.2

Attaching package: 'car'
The following object is masked from 'package:purrr':

    some
The following object is masked from 'package:dplyr':

    recode
hyp_matrix <- m3$coefficients[4:length(m3$coefficients)]
hyp_matrix
 distid_factor2010  distid_factor2020  distid_factor2070  distid_factor2080 
       -12.8879059        -38.5196544         30.6091348         -1.4666536 
 distid_factor3000  distid_factor3010  distid_factor3020  distid_factor3030 
        31.4463925         32.0520984         38.6608643         26.2832218 
 distid_factor3040  distid_factor3050  distid_factor3060  distid_factor3070 
        34.5786957          1.2430582         15.5960035         25.2889233 
 distid_factor3080  distid_factor3100  distid_factor3440  distid_factor4000 
        11.4043721         38.5346519         48.6522448         14.6409064 
 distid_factor4010  distid_factor5010  distid_factor5035  distid_factor5040 
        18.4423832        -34.1619823         11.5116176         36.1782572 
 distid_factor5060  distid_factor5065  distid_factor5070  distid_factor6010 
        33.8871883         12.2009238          0.9388914        -15.3787905 
 distid_factor6020  distid_factor6050  distid_factor7010  distid_factor7040 
        16.4756663          9.0369820        -12.4549201         -1.2219777 
 distid_factor8010  distid_factor8030  distid_factor8050  distid_factor9010 
        24.9964308         31.3272815         27.0426339         10.5417537 
 distid_factor9030  distid_factor9050  distid_factor9090 distid_factor10015 
        32.6087862         28.5889567         15.7557580         13.9161165 
distid_factor10025 distid_factor11000 distid_factor11010 distid_factor11020 
        14.4193951         21.3318637        -50.1281068         44.8208218 
distid_factor11030 distid_factor11033 distid_factor11160 distid_factor11200 
        44.5425812         31.7720089         10.8936157          3.0759324 
distid_factor11210 distid_factor11240 distid_factor11250 distid_factor11300 
        37.2619118         11.1208582         -3.9519448         14.4902186 
distid_factor11310 distid_factor11320 distid_factor11330 distid_factor11340 
        28.2010524         15.5796470          2.5838908          8.6391797 
distid_factor11670 distid_factor12010 distid_factor12020 distid_factor12040 
        30.8382474         16.7421196          7.0096125         13.3202862 
distid_factor13000 distid_factor13010 distid_factor13020 distid_factor13050 
        20.0264320        -24.6506995        -29.0996678         18.3614546 
distid_factor13070 distid_factor13080 distid_factor13090 distid_factor13095 
        34.7181123         25.2021249         31.1293255         12.2366997 
distid_factor13110 distid_factor13120 distid_factor13130 distid_factor13135 
        35.6648790         29.1802028         16.8000537         21.9505323 
distid_factor14010 distid_factor14020 distid_factor14030 
         2.7404170         17.0680687         15.1367846 

Note: should all variables be included or just dummie coefficients?

linearHypothesis(m3, hypothesis.matrix=m3$coefficients, test="F")
Linear hypothesis test

Hypothesis:
- 29.659078324302 ((Intercept)  + 0.00883871591758724 expp  + 0.800728644366392 lunch - 12.8879059033502 distid_factor2010 - 38.5196544371993 distid_factor2020  + 30.6091347553317 distid_factor2070 - 1.46665356901995 distid_factor2080  + 31.4463924541087 distid_factor3000  + 32.0520984327781 distid_factor3010  + 38.660864337636 distid_factor3020  + 26.2832218272459 distid_factor3030  + 34.5786957133566 distid_factor3040  + 1.24305817430693 distid_factor3050  + 15.5960034679161 distid_factor3060  + 25.2889232993897 distid_factor3070  + 11.4043721372269 distid_factor3080  + 38.5346518567923 distid_factor3100  + 48.65224475869 distid_factor3440  + 14.6409064466702 distid_factor4000  + 18.4423832480221 distid_factor4010 - 34.1619823297569 distid_factor5010  + 11.5116176266703 distid_factor5035  + 36.1782571822727 distid_factor5040  + 33.8871883053289 distid_factor5060  + 12.2009238446775 distid_factor5065  + 0.938891448507794 distid_factor5070 - 15.378790498365 distid_factor6010  + 16.4756662976467 distid_factor6020  + 9.03698196566981 distid_factor6050 - 12.45492010021 distid_factor7010 - 1.22197771869922 distid_factor7040  + 24.9964308467043 distid_factor8010  + 31.3272814806375 distid_factor8030  + 27.0426339455797 distid_factor8050  + 10.541753702068 distid_factor9010  + 32.6087861823466 distid_factor9030  + 28.5889567239894 distid_factor9050  + 15.7557579626307 distid_factor9090  + 13.9161164844713 distid_factor10015  + 14.4193951009245 distid_factor10025  + 21.3318636512816 distid_factor11000 - 50.1281068257429 distid_factor11010  + 44.820821768087 distid_factor11020  + 44.5425811882177 distid_factor11030  + 31.7720089091413 distid_factor11033  + 10.893615701858 distid_factor11160  + 3.07593243152192 distid_factor11200  + 37.2619117837214 distid_factor11210  + 11.1208581721843 distid_factor11240 - 3.951944849281 distid_factor11250  + 14.4902185746625 distid_factor11300  + 28.2010523942372 distid_factor11310  + 15.5796469637255 distid_factor11320  + 2.58389083377864 distid_factor11330  + 8.63917974142523 distid_factor11340  + 30.8382473948591 distid_factor11670  + 16.7421195973811 distid_factor12010  + 7.0096124993823 distid_factor12020  + 13.3202861812025 distid_factor12040  + 20.0264320117151 distid_factor13000 - 24.6506994627852 distid_factor13010 - 29.0996677521167 distid_factor13020  + 18.3614546071835 distid_factor13050  + 34.7181123284493 distid_factor13070  + 25.2021249057687 distid_factor13080  + 31.129325522645 distid_factor13090  + 12.2366996554302 distid_factor13095  + 35.6648790142611 distid_factor13110  + 29.1802027911552 distid_factor13120  + 16.8000537453834 distid_factor13130  + 21.9505322645895 distid_factor13135  + 2.74041697874341 distid_factor14010  + 17.0680686993034 distid_factor14020  + 15.136784599718 distid_factor14030 = 0

Model 1: restricted model
Model 2: math4 ~ expp + lunch + distid_factor

  Res.Df Df      F    Pr(>F)    
1    427                        
2    426  1 33.646 1.291e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F-test is significant at the 99% level, so we cannot reject the hypothesis that the dummies are jointly zero at the 99% level.

Question 7

Now we will use a specially-designed function to estimate a model with fixed effects. (Using the whole data set once again), use feols() from the fixest package in R, reghdfe from the reghdfe package in Stata, or PanelOLS from linearmodels in Python to estimate the model from step 4 but with fixed effects for distid. Save the result as m4. Include standard errors clustered at the distid level.

library(fixest)
Warning: package 'fixest' was built under R version 4.1.2
m4 <- feols(math4 ~  expp + lunch | distid,
             data = mp)

# Note that standard errors will be clustered by the first 
# fixed effect by default. Set se = 'standard' to not do this
msummary(m4, stars = c('*' = .1, '**' = .05, '***' = .01))
Model 1
expp 0.012***
(0.000)
lunch 0.314***
(0.094)
Num.Obs. 3850
R2 0.680
R2 Adj. 0.626
R2 Within 0.499
R2 Pseudo
AIC 29984.4
BIC 33437.6
Log.Lik. −14440.199
Std.Errors by: distid
FE: distid X
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 8

Now add fixed effects for year to your model from step 7 to create a two-way fixed effects model. Keep the standard errors clustered at the distid level. Save the results as m5.

m5 <- feols(math4 ~  expp + lunch | distid,
             data = mp)

# Note that standard errors will be clustered by the first 
# fixed effect by default. Set se = 'standard' to not do this
msummary(m5, stars = c('*' = .1, '**' = .05, '***' = .01))
Model 1
expp 0.012***
(0.000)
lunch 0.314***
(0.094)
Num.Obs. 3850
R2 0.680
R2 Adj. 0.626
R2 Within 0.499
R2 Pseudo
AIC 29984.4
BIC 33437.6
Log.Lik. −14440.199
Std.Errors by: distid
FE: distid X
* p < 0.1, ** p < 0.05, *** p < 0.01

Question 9

Using modelsummary() from modelsummary in R, esttab from estout in Stata, or Stargazer from stargazer.stargazer in Python, make a regression table including m1 through m5 in the same table so you can compare them all. Read the documentation of your command to figure out how to include the expp, lunch, expp_demean, and lunch_demean predictors in the table without clogging the thing up with a bunch of dummy coefficients from m3.

Write down two interesting things you notice from the table. Multiple possible answers here.

models <- list(m1, m2, m3, m4, m5)
cm <- c('math4' = 'math4',
        'expp' = 'expp',
         'lunch' = 'lunch',
         'expp_demean' = 'expp_demean',
         'lunch_demean' = 'lunch_demean',
         'math4_demean' = 'math4_demean',
        '(Intercept)' = 'Constant')
msummary(models, coef_map=cm)
Model 1 Model 2 Model 3 Model 4 Model 5
expp 0.007 0.009 0.012 0.012
(0.000) (0.001) (0.000) (0.000)
lunch −0.381 0.801 0.314 0.314
(0.016) (0.193) (0.094) (0.094)
expp_demean 0.007
(0.000)
lunch_demean −0.381
(0.016)
Constant 29.527 0.000 −29.659
(1.139) (0.244) (8.785)
Num.Obs. 3850 3850 500 3850 3850
R2 0.308 0.308 0.680 0.680
R2 Adj. 0.307 0.307 0.626 0.626
R2 Within 0.499 0.499
R2 Pseudo
AIC 31856.6 31856.6 4045.8 29984.4 29984.4
BIC 31881.6 31881.6 4361.9 33437.6 33437.6
Log.Lik. −15924.294 −15924.294 −1947.924 −14440.199 −14440.199
F 855.419 855.419
Std.Errors by: distid by: distid
FE: distid X X

It is interesting that although the coefficient for ‘expp’ stays relatively constant (around 0.01) across models 1-5, the coefficient for ‘lunch’ varies more (even changing sign, ranging from -0.381 to 0.801). It is also interesting that the AIC for model 3 is so much lower than that of the other models (off by a factor of 10), as is the BIC for model 3 (also lower than that of other models by a factor of 10). This suggests that model 3 is the best fit of all 5 models.

  1. Finally, we’ll close it out by using correlated random effects instead of fixed effects (see 16.3.3). You already have expp_demean and lunch_demean from earlier. Now, modify the code from that slightly to add on expp_mean and lunch_mean (the mean within distid instead of the value minus that mean). Then, regress math4 on expp_demean, lunch_demean, expp_mean, and lunch_mean, with random effects for distid using lmer() from lme4 in R, xtreg, re in Stata, or RandomEffects from linearmodels in Python. Show a summary of the regression results.

Language-specific instructions: - In R, lmer() has a hard time with numeric variables as categorical indicators. Create a new, factor version of distid directly in the data before running the model, and use that instead.

library(lme4)
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
mp <- mp|> mutate(expp_mean=mean(expp), lunch_mean=mean(lunch), factor_distid=as.factor(distid))
m6 <- lmer(math4~expp_demean + lunch_demean + expp_mean + lunch_mean | factor_distid, data=mp)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
unable to evaluate scaled gradient
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge: degenerate Hessian with 3 negative eigenvalues
msummary(m6)
Warning: `modelsummary` uses the `performance` package to extract goodness-of-fit statistics from models of this class. You can specify the statistics you wish to compute by supplying a `metrics` argument to `modelsummary`, which will then push it forward to `performance`: `modelsummary(mod,metrics=c("RMSE","R2")` See `?performance::performance` for more information. Please note that some statistics are expensive to compute.
This warning is displayed once per session.
Model 1
(Intercept) 45.923
(10.794)
SD (Intercept) 9.942
SD (expp_demean) 0.015
SD (lunch_demean) 0.302
SD (expp_mean) 0.004
SD (lunch_mean) 9.316
Cor (Intercept~expp_demean) 0.668
Cor (Intercept~lunch_demean) −0.143
Cor (Intercept~expp_mean) 0.333
Cor (Intercept~lunch_mean) −0.034
SD (Observations) 9.325
Num.Obs. 3850
RMSE 8.65